In this case study I worked for a fictional company called Cyclistic, a bike-share company, based in Chicago. I followed the process from the Google Data Analytics Course; Ask, Prepare, Process, Analyse, Share and Act.
To identify how annual members and casual riders use Cyclistic bikes differently enabling the business to design marketing strategies.
My analysis will consist of;
analysis of temporal daily, weekly and monthly journey patterns relating to casual users, with comparison to subscribers when appropriate
analysis of journey durations in conjunction with the temporal patterns
insight into any geographical patterns
The historial trip data relates to Cyclistic Bike-Share, a fictional company.
The original data is publicly available and downloaded from Divvy, and made available by Motivate International Inc. under licence here.
As a newcomer to data analytics I decided to attempt the whole exercise in R to enhance my skills. To avoid errors due to available memory in RStudio I I followed the course guidelines to use the quarterly data, instead of more recent monthly data. This consisted of 4 files and, up to 15 columns of data, only 8 of which existed in all 4 files. The data is comprehensive and contained 3.76 million rows of data after cleaning each row relating to a single trip.
I checked the reliability of the data for missing values and anomalies. The steps I took to correct these are outlined below.
To ensure the data is reliable I carried out a number of steps to re-organise, standardise and clean the data. The steps are outlined below:
```{# Check and install packages only if needed} packages <- c(“dplyr”, “skimr”, “janitor”, “here”, “lubridate”, “ggplot2”, “knitr”) new_packages <- packages[!(packages %in% installed.packages()[,“Package”])] if(length(new_packages)) install.packages(new_packages)
library(dplyr) library(skimr) library(janitor) library(here) library(lubridate) library(ggplot2) library(knitr)
``` r
library(dplyr)
library(skimr)
library(janitor)
library(here)
library(lubridate)
library(ggplot2)
library(knitr)
Q1 <- read.csv("~/Desktop/Cyclistic Files/Divvy_Trips_2020_Q1.csv")
Q2 <- read.csv("~/Desktop/Cyclistic Files/Divvy_Trips_2019_Q2.csv")
Q3 <- read.csv("~/Desktop/Cyclistic Files/Divvy_Trips_2019_Q3.csv")
Q4 <- read.csv("~/Desktop/Cyclistic Files/Divvy_Trips_2019_Q4.csv")
R code to rename and standardise column names
Q1 <- rename(Q1,
trip_id=ride_id,
start_time_2 = started_at,
end_time_2 = ended_at,
usertype = member_casual
)
Q4 <- rename(Q4,
start_station_id = from_station_id,
start_station_name = from_station_name,
end_station_id = to_station_id,
end_station_name = to_station_name,
end_time_2 = end_time,
start_time_2 = start_time
)
Q3 <- rename(Q3,
start_station_id = from_station_id,
start_station_name = from_station_name,
end_station_id = to_station_id,
end_station_name = to_station_name,
end_time_2 = end_time,
start_time_2 = start_time
)
Q2 <- rename(Q2,
trip_id = X01...Rental.Details.Rental.ID,
start_time_2 = X01...Rental.Details.Local.Start.Time,
end_time_2 = X01...Rental.Details.Local.End.Time,
bike_id = X01...Rental.Details.Bike.ID,
trip_duration = X01...Rental.Details.Duration.In.Seconds.Uncapped,
start_station_id = X03...Rental.Start.Station.ID,
start_station_name = X03...Rental.Start.Station.Name,
end_station_name = X02...Rental.End.Station.ID,
end_station_id = X02...Rental.End.Station.Name,
Member.Birth_Year = X05...Member.Details.Member.Birthday.Year,
usertype = User.Type
)
R code to convert date from chr to ddtm
Q1$start_time <- as_datetime(Q1$start_time_2)
Q1$end_time <- as_datetime(Q1$end_time_2)
Q2$start_time <- as_datetime(Q2$start_time_2)
Q2$end_time <- as_datetime(Q2$end_time_2)
Q3$start_time <- as_datetime(Q3$start_time_2)
Q3$end_time <- as_datetime(Q3$end_time_2)
Q4$start_time <- as_datetime(Q4$start_time_2)
Q4$end_time <- as_datetime(Q4$end_time_2)
The R code below calculates trip duration. It added new columns for duration, as well as, day, month and season of travel to enable the creation of the charts below. It also renames usertypes to ensure consistency before joining the data sets.
Q1v2 <- Q1%>%
mutate(
trip_duration = end_time - start_time,
hours = hour(start_time),
weekday = weekdays(start_time),
month = month(start_time),
quarter = quarters(start_time),
usertype = ifelse(as.character(usertype) == "member", "Subscriber", as.character(usertype)),
usertype = ifelse(as.character(usertype) == "casual", "Casual", as.character(usertype))
)
Q2v2 <- Q2%>%
mutate(
trip_duration = end_time - start_time,
hours = hour(start_time),
weekday = weekdays(start_time),
month = month(start_time),
quarter = quarters(start_time),
usertype = ifelse(as.character(usertype) == "Customer", "Casual", as.character(usertype))
)
Q3v2 <- Q3%>%
mutate(
trip_duration = end_time - start_time,
hours = hour(start_time),
weekday = weekdays(start_time),
month = month(start_time),
quarter = quarters(start_time),
usertype = ifelse(as.character(usertype) == "Customer", "Casual", as.character(usertype))
)
Q4v2 <- Q4%>%
mutate(
trip_duration = end_time - start_time,
hours = hour(start_time),
weekday = weekdays(start_time),
month = month(start_time),
quarter = quarters(start_time),
usertype = ifelse(as.character(usertype) == "Customer", "Casual", as.character(usertype))
)
The following code deletes those columns not present in all 4 files / datasets prior to joining the data sets
R code to reorder columns prior to joining the data sets
Q1v2 <- Q1v2 %>% relocate(start_station_id, .before=start_station_name)
Q1v2 <- Q1v2 %>% relocate(end_station_id, .before=end_station_name)
Q2v2 <- Q2v2 %>% relocate(trip_duration, .after=end_time)
R code to bind/join all 4 datasets together
all_qtrs <- rbind(Q1v2,Q2v2,Q3v2,Q4v2)
R code to delete rows with n/a ( row 414427)
# Clean n/a
all_qtrs <- all_qtrs[complete.cases(all_qtrs), ]
R code to select rows with N/A (“1” searches by row, “2” by column
rows_with_na <- all_qtrs[apply(
all_qtrs,
1,
function(x) any(is.na(x))
), ]
R code to remove rows with trip durations which are;
all_qtrs <- all_qtrs[!(all_qtrs$trip_duration == 0),]
all_qtrs <- all_qtrs[!(all_qtrs$trip_duration < 0),]
all_qtrs <- all_qtrs[!(all_qtrs$trip_duration > 86400),]
all_qtrs <- all_qtrs[!(all_qtrs$start_station_id == all_qtrs$end_station_id),]
Below are some basic calculations for subscribers and casual users
combined. It’s apparent that casual users make longer journeys. This
will be explored in greater depth later.
On the basis of
findings I removed outliers (see earlier step above). For example, those
trips with a duration over 24 hours, were removed. This is because a day
pass runs out after 24 hours (or 86,400 seconds). Also, those trips
which began and ended at same station were removed (all under 32
seconds).
It’s also evident that more trips are made on
weekdays. We’ll also review this in more depth later.
kable(all_qtrs %>%
summarise(
Mean = mean(all_qtrs$trip_duration, na.rm = TRUE),
Maximum = max(all_qtrs$trip_duration, na.rm = TRUE),
Minimum = min(all_qtrs$trip_duration, na.rm = TRUE),
Standard_Deviation = sd(all_qtrs$trip_duration,na.rm = TRUE)
),"pipe",col.names = c("Mean","Maximum","Minimum","Standard Deviation"))
| Mean | Maximum | Minimum | Standard Deviation |
|---|---|---|---|
| 1099.371 secs | 86385 secs | 32 secs | 2014.439 |
kable(all_qtrs %>%
group_by(usertype) %>%
summarise(Mean = mean(trip_duration),
Maximum = max(trip_duration),
Minimum = min(trip_duration),
Standard_Deviation = sd(trip_duration)
),"pipe",col.names = c("Usertype", "Mean","Maximum","Minimum","Standard Deviation"))
| Usertype | Mean | Maximum | Minimum | Standard Deviation |
|---|---|---|---|---|
| Casual | 2267.8891 secs | 86349 secs | 45 secs | 3506.833 |
| Subscriber | 770.2211 secs | 86385 secs | 32 secs | 1115.339 |
kable(all_qtrs %>%
group_by(weekday, usertype) %>%
summarise(
Average_Trip = mean(trip_duration),
Shortest_trip = min(trip_duration),
Longest_trip = max(trip_duration),
Standard_Deviation=sd(trip_duration)
),"pipe",col.names = c("Weekday","Usertype","Average Trip","Shortest Trip","Longest Trip","Standard Deviation"))
## `summarise()` has grouped output by 'weekday'. You can override using the
## `.groups` argument.
kable(all_qtrs %>%
count(weekday),"pipe",col.names = c("Weekday","Number of Trips"))
| Weekday | Number of Trips |
|---|---|
| Friday | 560425 |
| Monday | 560892 |
| Saturday | 475681 |
| Sunday | 426671 |
| Thursday | 572814 |
| Tuesday | 585874 |
| Wednesday | 579337 |
I made five recommendations in total but these tend to overlap
so could be combined and condensed into fewer.
1. A weekend only annual subscription.
2. A 6 month, mid-April to mid-October, seasonal pass.
3. A hybrid of my first two recommendations. An annual membership for weekend use, during a 6 month period.
4. A discounted subscription, aimed at casual users, allowing full weekend use and weekday use outside peak (commuter) hours. As a bonus this should ensure enough cycles are always available to all users.
5. Target selected stations to engage with customers. Specifically, Streeter & Grand Avenue on any given day but focusing on Saturdays and Sundays.